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^ ■ Appendix A 

In this appendix, we state and prove the correctness of functions entry (i.e., the 
correctness of our notion of extended SLD resolution) and prop, as defined in the 
^ , body of the paper. 

First, we recall the definition of function entry: 
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Definition 1 [entry procedure) 

Let H <^ Bi, . . . , Bn be a clause and (A, tt, p) an extended atom such that A and 
H unify. We denote with entry a function that propagates tt and fi to Bi, ... , B„. 
CN ! Formally, entry{Tr, ^i, [H <- Bi,...,B„)) = ((Bi, tti, /xi), . . . , (B„, 7r„, ^„)) if, for 

all Bi = Pi{tii, . . . , tirm), i — 1, . . . ,n, the following conditions hold: 



• J G TTi iff Var{tij) C Var{T:{H)) (i.e., all variables in tij are ground in H 
■ according to tt). 

• {1, . . . , nii} D {ji, . . . ,jk} e Hi iff there are (non necessarily different) vari- 
ables (xj^, . . . ,Xji^) e {Var{tij^), . . . ,Var{tij^)) such that for every pair of 
different variables Xj^,Xj^, we have {xj^,Xj^) £ pi-{H) (i-e., either the terms 
sliare some variable or have different variables that are shared in H according 
to /i). 

Now, let us introduce the following notion of safeness that will become useful to 
prove the correctness results. 

Definition 2 (safeness) 

Let p{ti, . . . , tn) be a run-time call. We say that a groundness call pattern tt is safe 
for p{ti, . . . ,tn) if « €E TT implies that Var{ti) = 0. Also, a sharing call pattern fi 
is safe for p(ti, . . . ,t„) if Var{ti) n Var{tj) ^ for some i,j £ {1, . . . ,n}, i ^ j, 
implies that i,j£ and i,j £ Sj, where ii — (si, . . . , s„). This notion is extended 
to queries in the natural way. 

Analogously, let p(ti, . . . ,tn) be a run-time call with computed answer substitu- 
tion 9. We say that a groundness success pattern tt' is safe for p{ti, . . . , tn)0 Hi £ n' 
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implies that Var(ti9) = 0. Also, a sharing success pattern fi' is safe for p{ti, . . . , tn)0 
if Var{ti9) n Var{tj9) ^ for some i,j S {1, • ■ ■ i ^ j, implies that «,j G s,; 
and i, J G Sj, where /i' = (si, . . . , s„). This notion is also extended to queries in the 
natural way. 

Given an extended atom (A, tt, /i) (typically with a partial evaluation call A), we 
say that it is safe if, for all run-time call AO, both tt and p are safe for AO. 

Given a partial evaluation call A with call and success groundness (resp. sharing) 
pattern pred(A) : tt tt' (resp. pred{A) : fj, fj,'), we say that this call and success 
pattern is safe for A if for all run-time call Aa and computed answer substitution 
6, the fact that tt (resp. /i) is safe for Aa implies that tt' (resp. /i') is safe for AaO. 

Now, we recall the notion of extended SLD resolution: 

Definition 3 [extended SLD resolution) 

Extended SLD resolution, denoted by is a natural extension of SLD reso- 
lution over extended queries. Formally, given a program P, an extended query 
Q — (Ai, TTi, /xi), . . . , (A„, 7r„, /i„), and a computation rule TZ, we say that <— 
Q '^p,'R.a ^ Q! is an extended SLD resolution step for Q with P and TZ if the 
following conditions hold0 

• 7?.(Q) = (^i,7ri,^i), 1 < j < n, is the selected extended atom, 

• iJ ^ _Bi , . . . , Bm is a renamed apart clause of P, 

• Ai and H unify with a = mgu{Ai, H), and 

• Q' = entryin,, p.,, (H ^ Bi,..., B™))crll 

The next lemma states the correctness of the extended SLD resolution. We only 
consider atomic extended queries, which is enough for our purposes. Moreover, 
information is not propagated between query atoms; this will be the purpose of 
function prop below. 

Lemma 1 

Let P be a program and let <— (A, tt, /i) -^o- <— [Bid, vri, /ii), . . . , (i?„cr, 7r„, /x„) be 
an extended resolution step with clause H ^ Bi, . . . , Bn- If {A, tt, ji) is safe, then 
{Bia,TTi, Hi) is safe too, where 7?.((i?i ct, tti, ^i), (i?„tT, 7r„, /i„)) = {Bia,TTi, fXi) 
for some selection strategy TZ. 

Proof 

Consider that entry^TT,^, (iJ <- . . . , P„)) = (Pi, tti, ^i), . . . , (P„,7r„, /i„). We 
prove that {Bia,TTi, pLi) with Bi ~ pi(ti, . . . ,tm) is safe by contradiction. For this 
purpose, we consider a run-time call AO such that AO Bi^, ■ ■ ■ , Bn^, where 
S = mgu{AO,H) (so BiS is an instance of Picr). 

Assume that there exists some j G tt^ such that Var{tj6) ^ 0. Therefore, there 
exists some variable x £ Var{tj) such that xb is not ground. By Definition [1] we 
have that x £ Var(7r(P)). However, since (A,7r, /i) is safe, tt[AO) must be ground, 

^ We often omit P, 7?, and/or u in the notation of an extended SLD resolution step when they 

are clear from the context. 
2 We let ((Bi, 7ri,/ii), . . . , (_B„, 7r„, /^„))(t = (Bicr, vri , /ii), . . . , (Bntj, 7r„,/i„). 
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and therefore xS must be ground after unifying A0 with H using 6 = mgu{A6, H), 
so that we get a contradiction. 

Assume now that — {si, . . . , s„i) and we have j, k Sj, j, k ^ Sk but Var{tjS)ri 
Var{tkS) 7^ 0. Therefore, either a) there exists some variable x G Var{tj) r\Var{tk) 
such that xS is not ground or b) there are different variables (a:, y) G {Var{tj), Var{tk)) 
such that Var{xS) n Var{yS) ^ 0. Consider the first case a). Here, we get imme- 
diately a contradiction since j, k must belong to sets Sj and by Definition [T] 
Consider now case b). Since (A, 7r,/i) is safe and x^y are bound to a term sharing 
variables in run-time call AQ^ we have (x,?/) G IJ.{H). Therefore, by Definition [1] 
J, k must belong to both sets Sj and Sk, and we get a contradiction too. □ 

Let us now recall the definition of function prop: 

Definition 4 {pattern propagation) 

Let Qi, Q2 be extended queries, with Qi = (Ai,7ri,^i), . . . , (A„,7r„,^„) and Q2 = 
(A„+i,7r„+i,/^„+i), . . . , (Am,7r„,^m). We define the function prop to propagate 
success patterns to the right as foUowsH 

• prop{Qi, Q2) — Q2 if n = (i.e., Qi is an empty query); 

• prop{Qi,Q2) = {{Ai,TTi,ni),prop{Q[,Q'2)) if n > 0, 
pred{Ai) : tti t:[, pred{Ai) : fii h4 fi[, 

entry{TT[,fi[,{Ai <- ^2,...,^™)) = (A2, tt^, ^^2), • ■ • , (^m, 7r;„, A^m), 
Q[ = {A2,TT2 n 7r^,Ai2 LI /i^), . . . , (A„,7r„ n n'^^iin U ^^), and 

Q2 = (-4„+l,7r„+i n 7r^^;i,/^„+i U p'n^i), . . . , (Am,7rm n ■K'^.Hm LI ^^). 

Finally, we prove the correctness of this function: 
Lemma 2 

Let {A, TT, /i) be a safe extended atomic query such that -s— (A, vr, /i) -^o-^ Q- Then 
prop{Q,true) is a safe extended query if the considered call and success patterns 
are safe. 

Proof 

We consider that Q has two extended atoms to simplify the proof (the extension 
to arbitrary atoms can be easily done by induction on the number of atoms). Let 
H Bi,B2 with entry{TT, iJ,,{H ^ Bi,i?2)) = (i?i, vri, ^1), (^2, 7r2, /i2) and a = 
mgu{A,H), so that Q = (i?icr, tti, /ii), (i?2cr, 7r2, /i2)- By Lemma [T] we know that 
both (Bicr, TTi, /Lti) and {B2(T,Tr2, ^2) are safe at clause entry (i.e., if selected first). 
Since we consider a left-to-right selection rule, only {Bi<t, tti, /^i) is safe in principle. 

Now consider the computation oi prop{{{Bia,Tri, ^i), {B2a, 112, p,2)), true). For 
this purpose, we consider the following safe call and success groundness and sharing 
patterns: pred{Bi) : tti h5 t:[ and pred{Bi) : /ii fj,[. Let entry{TT[, fx[, {Bia <— 
B20)) — {B2(7,T^2, pi'2)- Now, we want to prove that 

prop{{{Bia, Tri,p.i), {B2a,TT2, H2)),true) = (Sicr, tti, /Ui), (B2tT, 7r2 n 7r2,/Z2 U /ij) 

^ Note the non-standard use of function entry to propagate success patterns to the right, despite 
the fact that Ai <— A2, . . . , Am is not really a program clause. 
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is a safe query. For this purpose, we only have to prove that {B2CT, 7r2 riTTj, /i2 U/i^ is 
safe since [Bia, tti, /xi) is already proved safe under a left-to- right selection strategy, 
as mentioned above. Let us consider a run-time call Bia9, together with an arbitrary 
computed answer substitution 6 for BiaO, so that B2CTO6 is a run-time call too. We 
prove the claim by contradiction. 

Assume that B2a- = p{ti, . . . ,tn) and that there is some « G 7r2 n such that 
Var{ti6S) ^ 0. By definition, i G 7r2 or i G ttj. By Lemma [U we have that 1:2 is 
safe at clause entry, so i ^ 112 since VaritiO) ^ 0. By a similar argument to that of 
Lemma[T](it again requires an application of function entry), we have that 1:2 is safe 
when BiaO succeeds, so i ^ ttj too since Var{ti05) ^ 0, and we get a contradiction. 

Consider now that Var{tj95)r\Var{tk96) 7^ but (j, k) Sj and j, k ^ Sfe, where 
/U2 LI ^2 = (■51, • ■ • , s„}. Since {B2(J, 772, /X2) is safe at clause entry, then Var{tj9) n 
Var{tk9) = 0. Therefore, it must be S that introduces some additional sharing. 
However, by applying a similar argument as that of Lemma [U we have that /ij is 
safe too when B2cr9 succeeds, so {j, k) G s'j and (j, k) G s'j, with ^2 = (si, ■ • ■ , s^) 
and, thus, (j, fc) G Sj and (j, fc) G s/c, which gives a contradiction to our previous 
assumption. □ 

Finally, the correctness of function partition is an easy consequence of Lemma [21 
Of course, correctness is only ensured when Q'2 and Q3 only contain user defined 
predicates or "safe" built-ins (i.e., built-ins without side effects, which do not depend 
on or may change the order of evaluation, etc). 
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Abstract 

I—] 

■ Traditional approaches to automatic AND-parallelization of logic programs rely on some 

■ static analysis to identify independent goals that can be safely and efficiently run in par- 
^ , allel in any possible execution. In this paper, we present a novel technique for generating 

annotations for independent AND-parallelism that is based on partial evaluation. Basi- 
cally, we augment a simple partial evaluation procedure with (run-time) groundness and 
variable sharing information so that parallel conjunctions are added to the residual clauses 
when the conditions for independence are met. In contrast to previous approaches, our 
partial evaluator is able to transform the source program in order to expose more oppor- 
, tunities for parallelism. To the best of our knowledge, we present the first approach to a 

parallelizing partial evaluator. 

To appear in Theory and Practice of Logic Programming. 



KEYWORDS: partial evaluation, automatic parallelization, program analysis 



■ 1 Introduction 

h: 

. With the widespread adoption of multi-core processors, the generation of automatic 

parallelizing compilers becomes an urgent need. On the other hand, there exist a 
number of program optimization techniques (like partial evaluation (j Jones et al. 199"3| ) 
that have not considered the introduction of parallelism so far, thus limiting its po- 
tential for improving program performance. 

In this work, we tackle the definition of a parallelizing partial evaluator which 
is able to automatically generate annotations for independent AND-parallclism 
from logic programs. In contrast to traditional approaches to automatic AND- 
parallelization of logic programs (which rely on some static analyses to identify 
independent goals that can be safely and efficiently run in parallel in any possible 
execution), our approach combines both run-time analyses and the dynamic infor- 
mation gathered during partial evaluation. Furthermore, it allows us to transform 



* This work has been partially supported by the Spanish Ministerio de Economia y Competitivi- 
dad (Secretaria de Estado de Investigacion, DesarroUo e Innovacion) under grant TIN2008- 
06622-C03-02 and by the Generalitat Valenciana under grant PROMETEO/2011/052. 
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the source program in order to expose niore opportunities for parallelism (e.g., 
we can have different specializations of a given clause so that some of them are 
parallelized and some are not, without adding run-time conditions). 

Partial evaluation. Partial evaluation ([Jones et al. 199"3|) is a well-known technique 
for program specialization. From a broader perspective, some partial evaluators are 
also able to optimize programs further by, e.g., shortening computations, removing 
unnecessary data structures and composing several procedures or functions into a 
comprehensive definition. Within this broader approach, given a program and a 
partial (incomplete) call, the essential components of partial evaluation are: the 
construction of a finite representation — generally a graph — of the possible execu- 
tions of (any instance of) the partial call, followed by the systematic extraction of 
a residual program (i.e., the partially evaluated program) from this graph. Intu- 
itively, optimization can be achieved by compressing paths in the graph, by deleting 
unfeasible paths, and by renaming expressions while removing unnecessary func- 
tion symbols. In this paper, we propose a novel source of optimization based on 
transforming some sequential constructions of residual programs into parallel ones. 

The theoretical foundations of partial evaluation for (normal) logic programs 
was first put on a solid basis by Lloyd and Shepherdson (|199ip . When pure logic 
programs are considered, the term partial deduction is often used. Roughly speaking, 
in order to compute the partial deduction of a logic program P w.r.t. a set of atoms 
A = {Ai,...,An}, one should construct finite — possibly incomplete — SLD trees 
for the atomic goals ^ Ai, . . . , ^ An , such that every leaf is either successful, a 
failure, or only contains atoms that are instances of {^i, . . . , An}', this is the so- 
called closedness condition ( [Lloyd and Shepherdson 1991[ ). The residual program 
then includes a resultant of the form Aia ■<— Q for every non-failing root-to-leaf 
derivation <— Ai ^* Q in the SLD trees. Similarly, we say that a residual 
program P' is closed when every atom in the body of the clauses of P' is an 
instance of a partially evaluated atom (i.e., an appropriate specialized definition 
exists). 

From an algorithmic perspective, in order to partially evaluate a program P w.r.t. 
an atom A, one starts with the initial set Ai — {A} and builds a finite (possibly 
incomplete) SLD tree for <— A. Then, all atoms in the leaves of this SLD tree which 
are not instances of A are added to the set, thus obtaining A2, and so forth. In order 
to keep the sequence Ai,A2, ■ ■ ■ finite, some generalization is often required, e.g., 
by replacing some predicate arguments with fresh variables. Some variant of the 
homeomorphic embedding ordering ([Leuschel 2002[) is often used to detect potential 
sources of non-termination. 

A sketch of this algorithm is shown in Figure [Tl where the unfolding rule unf(Ai) 
builds finite SLD trees for the atoms in Ai and returns the associated resultants, 
function atoms returns the atoms in the bodies of these resultants, and the abstrac- 
tion operator abs{Ai, A') returns an approximation of Ai U A' so that the sequence 
Ai,A2,--- is kept finite. 

Motivation. Depending on when control issues — like deciding which atoms should 
or should not be unfolded — are addressed, two main approaches to partial evalua- 
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Initialization: i :— 1; Ai :— {A}; 
Repeat 

^,+1 := abs{Ai, atoms {unf (Ai))); 

i-- i + 1 
Until Ai ~ Ai~i (variants) 
Return unf{Ai) 



Fig. 1. Partial evaluation procedure 

tion can be distinguished. In offline approaches to partial evaluation, these decisions 
are taken beforehand by means of a so called binding-time analysis (where we know 
which parameters are known but not their values). In contrast, online partial eval- 
uators take decisions on the fly (so that actual values of static data are available). 

While offline partial evaluators are usually faster, online ones produce more ac- 
curate results. Partial evaluators for logic programs have mostly followed the on- 
line approach (e.g., SAGE (IGurr 19941) . Mixtus (ISahlin 19901) . SP (Gallagher 1991[ ), 
ECCE (jLeuschel et al. 20"06|) ). though some offline partial evaluators have been also 
developed (e.g., LOGEN (ILeuschel et al. 2006p ). 

Recently, we have proposed in (jVidal 201ip a hybrid approach to partial evalua- 
tion that does not fit well in neither the offline nor the online style of partial evalu- 
ation. Basically, we follow a typical online partial evaluation scheme, but augment 
it with run-time information gathered from a pre-processing static analysis. There 
are some previous approaches that combine the online and offline styles of partial 
evaluation. However, the novelty is that (jVidal 201 ip considers collecting run-time 
information rather than partial evaluation time information in a pre-processing 
stage (as binding-time analyses do). 

In this paper, we want to push this approach forward by defining a parallelizing 
partial evaluator that generates annotations for independent AND-parallelism. 

As it is well known, two goals {Gi,G2)0 are strictly independent if, for every 
pair of variables {x,y) G {Var{Gi),Var{G2)), either (i) they are equal, x = y, 
and x9 is ground (i.e., Var(x9) = 0) or (ii) they are different, x ^ y, and their 
values, x6 and y9, do not share a common variable (i.e., Var{x6) D Var{y6) = 0). 
In order to have this information available at partial evaluation time, we need some 
run-time information that is not usually present in partial evaluation schemes. For 
this purpose, we introduce a hybrid partial evaluation scheme with the following 
features: 

• First, a pre-processing stage performs both a groundness and sharing analysis, 
so that we get call and success patterns for each predicate. 

• Then, we apply a rather simple partial evaluation stage that only performs 
one-step unfolding. This is very limited in general and propagates almost no 
information. However, in our context, we do not aim at aggressively propa- 
gating static data but only groundness and sharing information. In this way, 
the potential for generating annotations for the implicit independent AND- 
parallelism can be better evaluated. 

• Finally, a post-processing stage extracts residual rules from the partial eval- 
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uation computations and, in some cases, replaces sequential conjunctions by 
parallel ones, thus boosting the performance of the residual program. 

A proof-of-concept implementation of the parallelizing partial evaluator is available 
at http://kaz.dsic.upv.es/litep.htinl. Despite its simplicity (a thousand lines 
of Prolog code), the results for definite logic programs (including some built-in's) 
are very encouraging. 

The paper is organized as follows. Section ^ presents the different stages of our 
parallelizing partial evaluation scheme. Then, Section [3] summarizes our findings 
from an experimental evaluation of the new technique and, finally, Sect.|4]concludes 
and discusses some possibilities for future work. Correctness results can be found 
in the online appendix. 

2 Parallelizing Partial Evaluation 

In this section, we present our partial evaluation scheme in a stepwise manner. We 
do so for clarity of presentation but these stages can be interleaved (and actually 
they are in our implementation). 

2.1 Pre- Processing Stage 

Our pre-processing stage consists of two different analyses. The first one is a simple 
call and success pattern analysis that resembles a mode analysis. The formal defi- 
nition of the analysis can be found elsewhere (e.g., in (jLeuschel and Vidal 2009^ ). 

We consider groundness call and success patterns n denoted by a list of natural 
numbers which represent the (definitively) ground arguments of a predicate. The 
underlying abstract domain is thus very simple: {definitively ground, possibly non- 
ground}. As mentioned in (|Leuschel and Vidal 2009[) . the analysis could be made 
more precise by considering a richer abstract domain (including elements like list, 
nonvar, etc). This is orthogonal to the topics of this paper and thus we keep the two 
element domain for simplicity. The greatest lower bound operator □ on patterns 
is defined in the natural way by the set union, i.e., given two patterns 7ri,7r2 for 
predicate p/n, we let tti n 7r2 = tti U 7r2 . 

Basically, given an initial query and the groundness call patterns for the atoms 
in this query, the analysis infers for every predicate p/n sl number of call and 
success patterns of the form p/n : 7ri„ it out such that 7ri„ and iiout are subsets 
of {1, . . . , n} denoting the arguments iTout of p/n which are definitely ground after 
a successful derivation, assuming that it is called with ground arguments 7ri„. The 
analysis is started with a number of entry points to the program, together with 
their initial groundness call patterns. 

Example 1 

Consider the well known definition of append /S: 
append{[], Y, Y). 

append{[H\T],Y,[H\TY]) <- append{T,Y,TY). 
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Given the initial groundness call patterns tti = {1} and 7r2 — {1,2} for append/3, 
the call and success pattern analysis would return the following mappings: 

append/3: {1} ^ {1} append/3: {1,2} ^ {1,2,3} 

Their meaning should be clear: if append{ti,t2,t3) is called with ti ground, we can 
only ensure that ti will be ground after a successful derivation. In contrast, if it 
is called with both ti and t2 ground, then ^3 will be also ground after a successful 
derivation. 

For guaranteeing the independence of goals, we also consider the information gath- 
ered by a dependency analysis like that of ( [Debray 1989[ ). Basically, for a given 
predicate p/3, the analysis computes mappings with sharing call and success pat- 
terns /z like, e.g., ({1, 2}, {1, 2, 3}, {2, 3}), which indicates that the first argument 
may share variables with the second argument, the second argument may share 
variables with the first and third arguments, and the third argument may share 
variables with the second argument. Again, the analysis infers for every predicate 
p/n a number of call and success patterns of the form p/n : /Zm ^ fiout such that 
liin and fiout belong to the domain x ... x - (a tuple of n sets) 

and flout denotes the dependencies of p/n which hold after a successful derivation, 
assuming that it is called with the dependencies denoted by 

In this case, the least upper bound operator U on sharing patterns is defined as 
follows: given patterns fi — . . . , i9„) and fi' = . . . , d'^) for some predicate 
p/n, we have /i U /x' = (t9i U z?^ , . . . , i?„ U i^'J. Note that, in contrast to the great- 
est lower bound on groundness patterns that may increase the number of ground 
variables (and thus the accuracy of the result), the least upper bound on sharing 
patterns may lose accuracy since more dependencies can be obtained. 

Example 2 

Consider again append/ 3. Given the sharing call patterns fii = ({1},{2},{3}) and 
fi2 = ({1, 2}, {1, 2}, {3}), the dependency analysis would return the following: 

append/3: ({1}, {2}, {3}) ({1, 3}, {2, 3}, {1, 2, 3}) 

append/3: ({1, 2}, {1, 2}, {3}) ^ ({1, 2, 3}, {1, 2, 3}, {1, 2, 3}) 

Here, we consider two possibilities: first, if append is called with three independent 
arguments then, after a successful derivation, the third argument may be bound 
to a value that shares variables with either the first and the second arguments; on 
the other hand, if append is called with the two first arguments bound to terms 
containing shared variables, then all three arguments may depend on each other 
after a successful derivation. 



We note that a sharing pattern like ({1}, {2}, {3}) assumes that all three argument are indepen- 
dent and, moreover, that no variable sharing can be introduced through a single argument; i.e., 
we assume that predicate arguments are always linear. We keep this restriction for simplicity 
but could easily be overcome. 
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2.2 Partial Evaluation Stage 

Now, we present the proper partial evaluation stage of the parallelizing partial 
evaluator. 

In principle, one could consider checking independence of goals using the infor- 
mation available solely at partial evaluation time. This approach, however, would 
be generally incorrect for a number of reasons. First, the notion of closedness (see 
Sect, [ll allows run-time atoms to be covered by instances of partial evaluation 
atoms. Therefore, q{X,X) is closed w.r.t. q{X,Y). This means that goals can be 
independent at partial evaluation time but need not be independent at run-time. 
Moreover, whenever we split a goal of an incomplete computation into atomic sub- 
goals, we are also loosing some context information that might be essential for 
checking independence, as the following example illustrates: 

Example 3 

Consider the following program 

^ q{X),r{Y). 
eq{X,X). 

Given the goal eq{A, B),p{A, B), if we split it into its atomic subgoals eq{A,B) 
and p{A,B), and partially evaluate them independently, we could derive the goal 
q{A),r{B) and incorrectly assume that q{A) and r{B) are independent. 

Furthermore, the use of an abstraction operator might also involve the loss of some 
dependencies (e.g., generalizing p{X, Y, f{Y)) to p{X, Y, Z) with Z a fresh variable). 

In summary, the information available at partial evaluation time is not enough 
to determine the run-time independence of a goall Therefore, as mentioned before, 
in this paper we consider that the partial evaluator includes a pre-processing stage 
where run-time groundness and sharing information is gathered. 

In particular, we design a rather simple partial evaluator with the following dis- 
tinguishing features: 

• only one-step unfolding of atomic goals is performed; 

• no static data are provided (i.e., the initial goal has different variables as 
arguments) ; 

• every atomic goal is enriched with groundness and sharing call patterns that 
are propagated through partial evaluation. 

The fact that we do not consider partially instantiated initial goals, together with 
the fact that only one-step unfolding is performed, allows us to better identify the 
potential for generating annotations for independent AND-parallelism. Moreover, it 
makes the online partial evaluator scale up better to medium and large applications. 

Our partial evaluator deals with sets of extended atoms (instead of sets of atoms, 
as in the algorithm of Figure [1]). 



^ Of course, we could avoid splitting goals, do not use an abstraction operator and only allow 
variants to be closed, but then the termination of partial evaluation could not be ensured. 
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Definition 1 {extended atom) 

We consider extended atoms of the form [A, tt, /i) where A is an atom, tt is a 
groundness caU pattern for A, and fi is a sharing call pattern for A. This notion is 
extended in the natural way to queries and goals. We denote the empty extended 
query by true. 

Given an extended query Q, we introduce the following auxiliary function: 
query{Q) = Ai, . . . , A„, if Q = (y4i, tti, ^i), . . . , (A„, 7r„, ^„). 



The number of different specialized versions of an atom will be determined, not 
only by its shape (as it is usually the case), but also by the different combinations 
of groundness and sharing call patterns. For instance, (p(X, y), {1, 2}, ({1}, {2})) 
and {p[X, Y), {!}, ({1}, {2})) would give rise to different specialized versions. 

Another distinguishing feature of our scheme is that, in contrast to previous 
approaches, we do not explicitly distinguish between the so-called local and global 
levels (as in ( Gallagher 1993[ )). Rather, we construct a single partial evaluation tree 
that comprises both levels. Moreover, our partial evaluation process performs just 
one pass since residual rules can be produced immediately after every unfolding 
step (rather than in a post-process, as it is often done since the unfolding tree can 
be modified during the partial evaluation process). 

In the following, we denote by 7r(^) the (definite) ground arguments of A ac- 
cording to TT, i.e., 7r(p(si, . . . , s^)) — {sj \ j G tt}. Also, we denote by fJ,{A) the set 
of (possibly) shared variables in A according to fi, i.e., /i(p(si, . . . , s^)) = {{x, y) G 
{yar[si),Var{sj)) \ i,j G ^ Q s € /i}. Before introducing the 

notion of SLD resolution over extended queries, we need the following preparatory 
definition, which is used to propagate groundness and sharing call patterns to the 
atoms in the body of a clause. 



Definition 2 [entry procedure) 

Let _ff _Bi, . . . , Bn be a clause and [A, tt, /i) an extended atom such that A and 
H unify. We denote with entry a function that propagates tt and /i to _Bi, . . . , i?„. 
Formally, entry{TT, fi, (H <- _Bi,...,B„)) = ((Bi, tti, /xi), . . . , (B„, 7r„, ^„)) if, for 
all Bi — Pi{tii, . . . , tirrii), i — I, ■ ■ . ,n, the following conditions hold: 



• j G TT,: iff Var{tij) C Var{TT{H)) (i.e., all variables in are ground in H 
according to tt). 

• {1, . . . , TTii} D {ji, . . . ,jk} G /ii iff there are (non necessarily different) vari- 
ables {xj-^, . . . ,Xji,) G {Var{tij-^), . . . ,Var{tij^)) such that for every pair of 
different variables Xj^,Xj^, we have {xj^,Xj^) G ^.{H) (i.e., either the terms 
share some variable or have different variables that are shared in H according 
to y). 

Note that the entry procedure is independent of A (only its associated groundness 
and sharing call patterns matter), since we want the results for a partial evaluation 
time atom A be valid for every run-time atom AO. 
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Example 4 

Let us consider the following program for computing Fibonacci numbers: 
(Ci) fibonacci{0,l). (^^2) fibonacci (1,1). 

(C3) fibonacci{M,N) ^ M > I, Ml is M ~ 1, fibonacci{Ml, Nl), 

M2 is M - 2, fibonacci{M2, N2), N is Nl + N2. 

Here, entry{{l}, ({!}, {2}), C3) returns the following extended query: 

(M>1, {1,2}, ({1},{2})), 
{Ml IS M- I, {2}, ({1},{2})), 
{ftbonacci{Ml,Nl), {}, ({1},{2}}), 
{M2isM~2, {2}, ({1},{2})), 
ifibonacci{M2,N2), {}, ({1},{2}}), 
iNtsNl + N2, {}, ({1},{2}}) 

We are now ready to introduce the notion of extended SLD resolution: 
Definition 3 [extended SLD resolution) 

Extended SLD resolution, denoted by is a natural extension of SLD reso- 
lution over extended queries. Formally, given a program P, an extended query 
Q = {Ai,iTi, ^i), . . . , (AnjTTn, l^n), and a computation rule TZ, we say that <— 
S ^p,TZ,a ^ Q! is an extended SLD resolution step for Q with P and TZ if the 
following conditions holdH 

• Tl{Q) ~ {Ai,TTi,^i), 1 < i < n, is the selected extended atom, 

• -s— _Bi, . . . , Bm is a renamed apart clause of P, 

• and H unify with a = mgu{Ai, H), and 

• Q' = entry{TT„ n„ (H ^Bi,..., P™))aEI 

Trivially, extended SLD resolution is a conservative extension of SLD resolution: 
given extended queries Q, Q', we have that Q Q' implies <— query (Q) 

query [Q'). 

In the following, we use pred{A) to denote the predicate symbol of atom A. 

As it is common practice, we avoid infinite unfolding by means of a well-known 
strategy based on the use of the homeomorphic embedding ordering. Intuitively, we 
say that atom Ai embeds atom Aj , denoted by Ai > Aj , when Aj can be obtained 
from Ai by deleting symbols (see (jLeuschel 2002^ for a precise definition). 

Definition 4 {variant, embedding) 

We say that two (extended) atoms {A,Tr,fi) and (A',7r',/z') are variants, denoted 
by {A,TT,^) « {A',TT',fi') if there is a renaming substitution p such that Ap = A', 
TT = n' and p = p' . 

We say that {A,iT,p) embeds {A',n',p'), denoted by {A,tt,p) \> [A',tt',p'), if 
A\> A',T^ = tt' and p = p' . 

^ We often omit P, TZ and/or a in the notation of an extended SLD resolution step when they 
are clear from the context. 

We let ((Bi, 7ri,/ii), . . . , {Bn,TT„, fi„))a = {Bia, tti , /ii), . . . , {B„a, TT„,fi„). 
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(variant) 




d(A , TT , /i ) G memo. \A, n , fi) ^ (A ,n , jj, ) 
((A, 7r, p), Q] memo) ^ > (Q; memo) 


(failure) 




PS^ ■ TT, /ij -^o- is; 
{[A, TT, fi), Q; memo) — y {Q; memo) 


(embedding) 




3{A',Tv',fi') e memo. {A,Tv,fi) > {A',Tv\fi') 
{{A, TT, fi), Q; memo) — ^ {Q; memo) 


(nonuser) 




pred{A) is not defined in the user's program clauses 
{{A, TT, fi), Q; memo) {Q; memo) 


(parallel) 


•f- 


- iA,TT,fj,) -^^^ Q' A 3(Qi, Qa, Qs, Qi) G v<irtition^{Q') 


{{A, 


TT,^), Q;memo) (Si, Q2, Qa, Qi, Q;memoU {{A,tt,^)}) 


(unfolding) 




- {A,TT,fi)-^^<~ Q' A MQi,Q2,Qa,Qi) epartition^iQ') 




TT, fi), Q; memo) — ^-o- {prop{Q', true), Q; memo U {{A, tt, /j.)}) 



Fig. 2. Partial evaluation semantics 



Our partial evaluation semantics is formalized by means of the (labelled) state 
transition system shown in Figure [5] The partial evaluator deals with states, defined 
as follows: 

Definition 5 (state) 

A state is a pair of the form (Q; memo) where Q is a sequence of extended atom^ 
and memo is a set of extended atoms (the atoms already partially evaluated, which 
are recorded to guarantee termination). 

An initial state has the form {{A, tt, /i); {}). A final state has the form (e; memo), 
where e denotes an empty sequence. 

A successful partial evaluation starts with an initial state and (non-deterministically, 
because of the unfolding rule) constructs a number of derivations of the form 
{{A,TT, n); {}) — >* (e;_), where — >* denotes the reflexive and transitive closure 
of — 5-. The process does not return anything but the trace itself, that will be used 
for producing residual rules (see the next section). 

Let us now explain the rules of the partial evaluation semantics. Rule (variant) dis- 
cards an extended atom if it is a variant of an already partially evaluated extended 
atom. Rule (failure) also discards an extended atom when it cannot be unfolded 
(e.g., when A does not unify with the head of any clause). 

The next rule, (embedding), discards an extended atom when it embeds a previ- 
ously partially evaluated extended atom. This rule is necessary in order to ensure 
that partial evaluation always terminates. Rule (nonuser) allows us to deal with 
built-in's and other extra-logical features of Prolog by leaving calls to the original 
predicates, as we will see in the next section. 



^ Note that this sequence is not an extended query. Rather, this is the queue of (extended) atomic 
goals to be partially evaluated. 
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The interesting rules are (parallel) and (unfolding). In the fohowing, we assume a 
fixed left-to-right selection rule as in Prolog. Therefore, we use a function prop to 
propagate groundness and sharing success patterns to the atoms to the right of a 
given atom before splitting an extended query. This is necessary because only the 
partial evaluation of atomic goals is allowed and, thus, this information should be 
propagated before the query is split into its constituents in order to avoid a serious 
loss of accuracy. 

Definition 6 [pattern propagation) 

Let Qi, Q2 be extended queries, with Qi — (^1, tti, /^i), . . . , (yl„, 7r„, /i„) and Q2 = 
(A„+i,7r„+i,/i„+i), . . . , (Am,7r„,/im). We define the function prop to propagate 
success patterns to the right as follows 

• prop{Qi, Q2) = Q2 if n^O (i.e., Qi is an empty query); 

• propiQi, Q2) = {{Ai,TTi,fii),prop(Q[, Q^)) if n > 0, 

pred{Ai) : tti tt[, pred{Ai) : ^1 H> 

entry{TT[,n[,{Ai ^ A2, . . . , A,n)) (A2, tTs, ^^2), • • • , (^m, tt^, A^m): 
Q[ = (^2,7r2 n7r2,/^2 U ^(j), . . . , (A„,7r„ n n'^, fi„ U /i^), and 

Q'2 = (^n-|-l,7r„+l n 7r^_^j,^„+i U fl'n^i), . . . , {A,n,TTm H TT^j, /im U ^^). 

Observe that the two arguments of function prop are not needed for unfolding a 
goal. However, this formulation will become useful later when also using prop to 
partition a goal. 

Example 5 

Consider again the Fibonacci program of Example[4]and the result of the entry pro- 
cedure. Thus we have fibonacci{A, B) ^{a>^m.b^n} {M > 1, {1, 2}, ({1}, {2})), 
{Ml IS M - l,{2},({l},{2})),(/i&onac«(Ml,^7Vl),{},({l},{2}}),(M2 is M - 
2, {2}, ({1}, {2})), iftbonacct{M2, iV2), {}, ({1}, {2})), {N ts N1+N2, {}, ({1}, {2})). 
We assume the following call and success patterns: 

^s/2■. {2}I^{1,2} ^s/2■. ({1}, {2}) ({1}, {2}) 

fibonacci/2: {1}^{1,2} fibonacci/2 : ({1}, {2}) ({1}, {2}) 

Then, for instance, we have 

propiiiM > 1, {1, 2}, ({1}, {2})), Ml is M - 1, {2}, ({1}, {2})), 

{fibonacci{Ml, Nl), {}, ({1}, {2}}), (Af2 is M - 2, {2}, ({1}, {2})), 
{fibonacci{M2, N2), {}, ({1}, {2})), {N is Nl + N2, {}, ({1}, {2}))), true) 
=((M > 1, {1, 2}, ({1}, {2})), (Ml IS M ~ 1, {2}, ({1}, {1})), 

{fibonacci{Ml, N1),{1}, {{!}, {2})), (M2 is M - 2, {2}, ({1}, {2})), 
ifibonacciiM2, N2), {1}, ({1}, {2})), {N is Nl + N2, {2}, ({1}, {2}))) 

so we know that, when the last call N is N1 + N2 is performed, A^l -|-iV2 is ground. 

Before explaining the rules (parallel) and (unfolding), we still need one more auxiliary 
function, partition, which is used to check if a query contains some subgoals that 
can be executed in parallel (i.e., if they are strictly independent at run-time): 

® Note the non-standard use of function entry to propagate success patterns to the right, despite 
the fact that Ai A2, . . . , Am is not really a program clause. 
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Definition 7 (partition) 

Let (A, TT, ^) be an extended atom such that {A, n, fi) Q- We introduce the 
function partition^ as follows^ 

• partition ^{Q) — {Q[, Q2, Q3 , Q4") if Q contains at least two extended atoms, 
Q = Qi, Q2, Q3, Q4, with Q2 and Q3 non-empty queries, 
(Qi, (Qi, Q[,, Q'i)) = propiQ,, (Q2, Q3, Q4)), 
Q'2 and Q3 are independent, 

[Q'i, Q'l) = prop{Q'2, Q'i), [Q'i, Q'i') = prop{Q'^, Q'i), and Q'l" = prop{Q'i' , true). 

Here, strict independence of Q2 ^'^'^ Q's, checked using the standard notion (see 
Section [1]) and taking into account the groundness cah patterns available from the 
extended atoms and the sharing call pattern for the head of the clause, i.e., the 
variables in Var^Q^) H Var{Q'^) must be ground according to the groundness call 
patterns in Q^^ Q3 and each pair of different variables {x, y) G {Var{Q'2),Var{Q'^)) 
should not be shared in the head of the clause according to /i. 

Example 6 

Consider again the Fibonacci program of Example H] and the extended SLD reso- 
lution step of Example [5j By applying function partition to the derived extended 
query, we get 

Qi=(M>l,{l},({l},{2})), 

Q2 -(Ml is M - 1, {2}, ({1}, {2})), {fihonacci{Ml, 7V1), {1}, ({1}, {2})), 
Q3 =(M2 is M - 2, {2}, ({1}, {2}}), {fihonacci{M2, N2), {1}, ({1}, {2})), 
Qi^iN ts iVl-HiV2,{2},({l},{2})) 

which means that the queries {M 1 is M — 1, fibonacci{Ml, Nl)) and (M2 is M — 
2, fibonacci{M2, N2)) can be safely run in parallel at run-time. 

Now, rules (parallel) and (unfolding) should be clear. When an atom is unfolded 
and the body of the selected clause can be run in parallel (which is determined by 
function partition), rule (parallel) applies. Note that we consider a simple algorithm 
where the atoms of a query cannot be reordered (i.e., we respect Prolog's computa- 
tion rule). Of course, more elaborated strategies exist (see, e.g., ([Muthukumar et al. 19991 
Gras and Hermenegildo 2009[ )), but we consider them out of the scope of this paper. 

When the body of the clause cannot be partitioned so that some subgoals are run 
in parallel, rule (unfolding) applies (which will give rise to a sequential clause, as we 
will see later). Here, we apply function prop in order to propagate groundness and 
sharing information to the extended atoms before they are split in the next step 
(since only the unfolding of atomic goals is considered) . 

In both rules, we add the selected extended atom to the set of already partially 
evaluated extended atoms. 

All transition rules are labelled with a letter that identifies the rule applied. This 
will become useful to generate residual rules (see the next section). 

In order not to encumber the notation, we assume that Q'^ refers to the same extended query 
Qi after some processing. 
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Example 7 

Consider again the Fibonacci program of Example |4l Given the initial state 

So = {{fibonacciiA, B), {1}, ({1}, {2})), {}) 

we have three partial evaluation derivations starting from Sq: 

So -^{A^o.B^i} {e,{{fibonacciiA,B),{l},{{l},{2}))}) 
So -^{A^i,B^i} {e,{{f^bonacci{A,B),{l},{{l},{2}))}) 
So -^{A^M.B^N} ((Qi, S2, Q3, Qi), {ifibonacct{A, B), {1}, ({1}, {2}})}} 
^ ((22, 23, 24), {{fibonacci{A, B), {1}, ({1}, {2}))}} 
^ {{{fibonacci{Ml, Nl), {1}, ({1}, {2})), 23, Qa). 

{(^6onacd(Ai?),{l},({l},{2}))}) 
^ ((23, 24), {{fibonacci{A, B), {1}, ({1}, {2}))}) 
^ {iifibonacci{M2, 7V2), {1}, ({1}, {2})), 24), 
{if^bonacc^iA,B),{l},{{l},{2}))}) 
((24), {ifibonacci{A, B), {1}, ({1}, {2}))}) 
^ (e, {(/z6onaca(A B), {1}, ({1}, {2}))}) 

Note that predicates not defined in the user's program (like > or is) are not un- 
foldable and that 2i, 22, 23, 24 are the extended queries of Example IHl 



2.3 Post-Processing Stage 

Once the partial evaluation stage terminates, we produce renamed, residual rules 
associated to the transitions of the partial evaluation semantics. In the following, 
we assume that there is a function ren that takes an extended atom and returns 
a renamed atom whose predicate name is fresh and depends on the patterns of 
the extended atom. We do not present the details of this renaming function here 



since it is a standard renaming as introduced in, e.g., (Benkerimi and Lloyd 1989 
De Schreye et al. 1999[ ). For instance. 



ren{fibonacci{X, Y), {!}, ({1}, {2})) = fibonacci _1.1^{X, Y) 
Note, however, that non-user predicates are not renamed, e.g., 

ren{Ml is M - 1, {2}, ({1}, {2})) = Ml is M - 1 
The generation of residual rules proceeds as follows: 

• We do not generate residual clauses associated to the application of rules 
(variant) nor (failure). 

• For embedding steps of the form {{A, tt, /i), 2; memo) (2; memo) we pro- 
duce a residual rule of the form re7j(A, tt, /i) ^ A. This means that some 
atoms will not be closed but defined in terms of calls to the original predi- 
cates (and, thus, the clauses of the original program should be added to the 
residual program). 

• For nonuser steps {{A, tt , n) , Q; memo) (Q;memo), we do not generate 
residual rules since non-user calls are not renamed. 
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• For an unfolding step {{A^tt, ij), Q;memo) {prop{Q' ,true), Q;memoLi 
{{A, TT, fJ.)}), we produce a residual rule of the form 

ren{A,Tr,n) <- ren(Bi, tti, ^i), . . . , ren(B„, 7r„, /i„). 

where prop{Q') = ((Si, vri, ^i), . . . , (B„, 7r„, ^„)). 

• Fmahy, for a parallel step {{A, n, Q; memo) (Qi, Q2, Q3, Q4, Q; memo 
U{(A, vr, /^)}), we produce a residual rule of the form 

ren{A,Tr,fi) <r- ren{Bi, tti, fii), ren{B„,n„, ^n), 
{ren{Bn+i,TTn+i, f^n+i), . ■ ■ , ren{B 

k ren{Bra+l,1Tra+l, i^m+l), ■ • ■ , ren{Bk, TTk, fJ-k)) , 

ren{Bk+i,TTk+i,fJ.k+i), • ■ • , ren{Bi,TTi, fn). 

where 

Qi = ((Bi,7ri,^i),...,(B„,7r„,^„)): 

Q2 = {{B 5 ( Bjji , TT^ , /i^ ) ) , 

Q3 = ((-B„i+1, TTm+l, Mm + l): ■ • ■ : {Bk^T^ki Mfc)), 
Q4 = ((Sfc+l,7rfc+i,^fc+i), . . . , {Bl,TTl,fil)). 

Example 8 

For instance, for the derivations of Example [71 we produce the following residual 
program: 

fibonacci_l-lJ2{0, 1). 
fibonacci-l-lJ2{l, 1). 

fibonacciAAJ2{M,N) <- M > 1, (Ml is M - lJibonacci_lAJ2{Ml,Nl) 

k M2 is M -2, fibonacciAAJ2{M2, N2)), 
N is N1 + N2. 

2.4 Correctness and Termination Issues 

The core of our new proposal mainly involves new control strategies, but the main 
procedure is still an instance of the standard partial evaluation framework, so its 
correctness should not be an issue. In particular, our partial evaluation scheme can 
be seen as an instance of the procedure of Benkerimi and Lloyd (|1989p . though 
in our case an atom is closed only if it is a variant (rather than an instance) of 
an already partially evaluated atom. Our approach is correct though since we add 
calls to the predicates of the original program for non-closed atoms (and the residual 
program includes a copy of the original program clauses) . 

Regarding the termination of partial evaluation, this is a well studied area and 
the approach that we consider based on the homeomorphic embedding ordering is 
quite standard ([Leuschel 2002p . 

Regarding the introduction of parallel conjunctions, in this paper we assume the 
correctness of the underlying groundness and dependency analyses. Moreover, we 
prove in the online appendix the correctness of the few functions introduced to 
propagate groundness and sharing patterns at partial evaluation time, entry and 
prop. Of course, the correctness of function partition can only be ensured when 
Q2 and Q3 only contain user defined predicates or "safe" built- ins (i.e., built- ins 
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without side effects, which do not depend on or may change the order of evaluation, 
etc). 

To summarize, this paper is not concerned with the development of new theo- 
retical developments regarding partial evaluation or program parallelization, but 
with the design of new control strategies that could allow us to improve existing 
partial evaluation techniques and use them to extract some implicit independent 
AND-parallelism of logic programs. Moreover, the proof-of-concept implementation 
of a parallelizing partial evaluator (that we discuss in the next section) shows that 
our approach is indeed viable in practice. 

3 Experimental Evaluation 

A prototype implementation of the parallelizing partial evaluator described so far 
has been developed. It consists of approx. 1000 lines of SWI Prolog code (includ- 
ing the groundness call and success pattern analysis, comments, etc). The only 
missing component is the sharing analysis, which currently should be provided by 
the user. In general, built-in's and extra-logical features are not unfolded, though 
our tool includes information regarding the propagation of groundness and sharing 
information for them. 

A web interface to our tool is available at http : / /kaz . dsic .upv . es/litep .html. 

We have tested it by running some typical benchmarks from the literature on au- 
tomatic independent AND-parallelization of logic programs (see, e.g., (jMuthukumar et al. 19991 
Gras and Hermenegildo 20091 )): 

• amatrix implements the addition of two matrices (a matrix is a list of lists) ; 

• fib computes the well-known Fibonacci function; 

• fiatten is used to flatten a list of lists of any nesting depth into a flat list; 

• hanoi solves the Towers of Hanoi problem; 

• msort implements the mergesort algorithm on lists; 

• mmatrix implements the multiplication of two matrices; 

• palin recognizes (list) palindromes; 

• qsort implements the quicksort algorithm on lists; 

• tak computes the Takeuchi function. 

Moreover, in order to test the scalability of the tool, we have also applied our 
parallelizing partial evaluation tool to itself (ppeval). The code of the examples 
can be found in the tool's webpage. 

We use SWI Prolog's concurrent/3 to run goals in parallel. Parallel processes in 
SWI Prolog, however, are not lightweight. As mentioned in (jSWI 2012p . if the goals 
are CPU intensive and normally all succeeding, typically the number of CPUs is 
the optimal number of threads. Less does not use all CPUs, more wastes time in 
context switches and also uses more memory. For instance, the unbound number of 
threads that would be created with the program of Example |S] would perform very 
badly for even small input values. In order to solve this problem, we replace calls 
to concurrent/3 by a special version as follows: 

concurrent_k(A,B,C) :- 
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Table 1. Experimental evaluation of the parallelizing partial evaluator 



benchmark 


Seq 


Pari 


Par2 


Par4 


Par6 


Pars 


fib 


1.00 


1.00 


1.83 


2.88 


3.82 


3.70 


hanoi 


1.00 


1.27 


1.54 


2.29 


2.05 


1.97 


mmatrix 


1.00 


1.05 


1.07 


1.09 


1.08 


1.07 


palin 


1.00 


1.07 


1.79 


2.52 


2.30 


2.41 


tak 


1.00 


0.98 


1.31 


1.31 


1.30 


1.31 


amatrix 


1.00 


1.02 


0.59 


0.30 


0.20 


0.16 


flatten 


1.00 


1.23 


0.72 


0.63 


0.61 


0.81 


msort 


1.00 


1.59 


0.86 


1.23 


1.22 


1.26 


qsort 


1.00 


1.73 


0.48 


0.71 


0.72 


0.60 


ppeval 


1.00 


1.00 


1.15 


SO 


SO 


SO 



curreiit_th.reads (N) , max_threads (K) , ! , 
(N < K -> M is N+1, 

retractall (current_threads (_) ) , assert (current_threads (M) ) , 

concurrent (2 , [B,C] , [] ) , 

current_threads (T) , S is T-1, 

retractall (current_threads(_)) , assert (current_threads (S) ) , 
; calKA) ). 

Basically, given queries Qi and Q2, concurrent _k[[Qi^ Q2), Q'l, Q'2) determines, 
depending on the current and maximum number of threads, if a sequential goal 
(Qi, S2) or a parallel goal Q'ihQ'2 should be run (where Q- is the parallel version 
of Q.). 

Table [1] summarizes our experimental results for the selected benchmarks. We 
executed SWI-Prolog (Muhi-threaded, 64 bits. Version 6.0.2) on a 2.66 GHz Quad- 
Core Intel Xeon (with 8GB 1066 MHz DDR3 RAM) running Mac OS X vlO.7.3. 
Therefore, one can expect the best results for a maximum of 4 threads. Run times 
have been obtained using SWI Prolog's get_time/l, which is similar to SICStus 
walltime and includes CPU time, garbage collection, etc. Rather than timings, we 
show the relative speedup (i.e., run time of the original program/run time of the 
residual program; values > 1 are then actual speedups) for each original program 
(column Seq), and its partially evaluated version using 1/2/4/6/8 cores (columns 
Parl/Par2/Par4/Par6/Par8). Here, SO indicates a stack overflow. 

First, we observe that the values of column Pari are not always 1.00. This is due 
to the effects of the partial evaluation. We tried to minimize it, but it seems that for 
some examples it still has a significant effect. The first group of benchmarks (fib, 
hanoi, mmatrix, palin and tak) show the expected results: Pari is generally 
close to 1 and the introduction of parallel threads produces noticeable speedups. 
For the second group of benchmarks (amatrix, flatten, msort and qsort), we get 
a slowdown in almost all cases but in msort (and, even in this case, the sequential 
partial evaluation is faster). Let us take a look at the results. For instance, for 
amatrix, we transform: 
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amatrix([Ll|01] , [L2I02] , [L3ID3]) :- aml(Ll,L2,L3) , amatrix(01,02,03) . 
into 

amatrix_par([A|B] , [CID] , [EIF]) :- 

concurrent_k( (ami (A,C,E) ,ainatrix(B,D,F)) , 

ajnl(A,C,E), amatrix_par(B,D,F)) . 

and leave the rest of the program untouched. For quicksort, we get 

quicksort_par( [A|B] ,C) :- partition(B,A,D,E) , 

concurrent_k((quicksort(D,F) , quicksort (E,G)) , 
quicksort_par(D,F) , 
quicksort_par (E,G) ) , 

append(F,A,G,C) . 

and the rest of the program is not modified. Similar results are obtained for flat- 
ten and msort. Note that the output of our tool is perfectly reasonable (i.e., it 
coincides with a typical parallelization by hand). So what explains the slowdowns 
produced? Besides the particularities of these benchmarks, it might be caused by 
the implemented model of parallel threads in SWI Prolog (which copies ground 
arguments instead of sharing them). Further investigating this point is a subject 
of ongoing research; e.g., we plan to test the benchmarks using a different Pro- 
log environment supporting source-level primitives for AND-parallelism. As for the 
third group, ppeval, we do not get a significant speedup but it allows us to check 
that the approach is viable in practice and scales up well to medium programs (the 
stack overflow corresponds to running the specialized partial evaluator to partially 
evaluate itself on 4 or more threads, and seems to be related to the limited size of 
threads' stacks — i.e., it is not a fault of ppeval). 

In summary, the experimental evaluation is still preliminary, but it clearly shows 
that there is a good potential for improving program performance by using a par- 
allelizing partial evaluator. Indeed, one can easily judge by visual inspection of the 
annotated programs (check the results in http://kaz.dsic.upv.es/litep.html) 
that our parallelizing partial evaluator uncovers as much parallelism opportunities 
as it is possible. We have not compared our tool with any existing parallelizing 
compiler for logic programs yet. On the one hand, because our tool is not yet ma- 
ture enough to deal with realistic Prolog applications. On the other hand, because 
we could not find a publicly available working system for source-level program par- 
allelization. 

4 Concluding Remarks and Future Work 

In this work, we have presented a novel approach to parallelizing partial evaluation. 
Analogously to standard approaches to automatic independent AND-parallelization 
of logic programs, our partial evaluator uses run-time groundness and dependency 
information. However, in contrast to these approaches, we can transform the source 
program in order to expose more opportunities for parallelization. We are not 
aware of any previous proposal along the same lines. ( [Consel and Danvy 1992[ 
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quite a different goal as ours. The closer approach we are aware of is that of 
(jSurati and Berlin 1994P , where a standard partial evaluator is used to expose some 
low level operations of a program so that a parallelization algorithm can be more 
successfully applied. They consider, however, two independent actions: standard 
partial evaluation and program parallelization, in contrast to ours. Nevertheless, 
the idea of combining partial evaluation and static analysis is not new (jJones 1997^ . 
Also, the use of partial evaluation to compile an instrumented interpreter can be 
used to enrich source programs with some additional information that can be use- 
ful for debugging or optimizing execution (see, e.g., (jPebois 20041 IJones 2004)) ). 
Although we are not aware of using it for generating annotations for parallelism 
so far, partially evaluating an interpreter instrumented with groundness and shar- 
ing information (so that conjunctions are executed in parallel when safe) could get 
similar results as our approach. 

Being a novel approach, we consider that there is plenty of room for further 
improvements. Firstly, one can consider the use of more accurate groundness and 
sharing analysis. Secondly, our partition procedure to extract two independent sub- 
goals that can be run in parallel is rather simple. We plan to extend it to allow 
an arbitrary number of parallel subgoals, and also to allow the reordering of some 
subgoals. We would also like to explore other notions of AND-parallelism like non- 
strict independent AND-parallelism or, even, dependent AND-parallelism. Finally, 
the combination of our approach with a more aggressive partial evaluation scheme 
is also an interesting avenue for future work. 
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